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ABSTRACT 

Seq2Logo is a web-based sequence logo generator. 
Sequence logos are a graphical representation of 
the information content stored in a multiple 
sequence alignment (MSA) and provide a compact 
and highly intuitive representation of the position- 
specific amino acid composition of binding motifs, 
active sites, etc. in biological sequences. Accurate 
generation of sequence logos is often compromised 
by sequence redundancy and low number of obser- 
vations. Moreover, most methods available for 
sequence logo generation focus on displaying the 
position-specific enrichment of amino acids, 
discarding the equally valuable information related 
to amino acid depletion. Seq2logo aims at resolving 
these issues allowing the user to include sequence 
weighting to correct for data redundancy, 
pseudo counts to correct for low number of obser- 
vations and different logotype representations 
each capturing different aspects related to amino 
acid enrichment and depletion. Besides 
allowing input in the format of peptides and 
MSA, Seq2Logo accepts input as Blast sequence 
profiles, providing easy access for non-expert 
end-users to characterize and identify functionally 
conserved/variable amino acids in any given 
protein of interest. The output from the server is a 
sequence logo and a PSSM. Seq2Logo is available at 
http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May 
2012, date last accessed). 



INTRODUCTION 

The idea of generating a logo from aligned sets of sequences 
was introduced in 1990 by Schneider and Stephens (1). The 
intention of a sequence logo is to concentrate into a single 
plot the general consensus, the order of predominance 
of residues at every position, the relative frequencies of 
every residue at every position, the amount of information 
present at every position and significant locations. This 
logo is then able to present all of the relevant information 
to the viewer in a fast and concise manner. 

Several webservers exist to generate sequence logos from 
MSA's (2-5). All these servers suffer from different limita- 
tions in the handling sequence redundancy and low number 
of observations. Moreover, to the best of our knowledge, all 
public sequence logo servers, with the exception of the 
Icelogo (4) and two-sample logo (5) methods, focus on dis- 
playing the position-specific enrichment of amino acids, dis- 
carding the equally valuable information related to amino 
acid depletion. Seq2logo aims at resolving these issues 
allowing the user to include sequence weighting to correct 
for data redundancy, pseudo counts to correct for low 
number of observations (6-8) and five different logotype 
representations each capturing different aspects related to 
amino acid enrichment and depletion. In addition to the 
usual Shannon logo (9), Seq2Logo includes the option to 
create Kullback-Leibler (KL) (10) logos where the 
depleted (under-represented) amino acids are represented 
on the negative j-axis. Besides the conventional KL logo, 
Seq2Logo can also display a weighted KL logo, where the 
relative height of each amino acid is proportional to the 
log-odds ratio and a probability weighted KL logo, where 
the relative height of each amino acid is proportional to the 
product of the probability and log-odds ratio. Finally, 
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inspired by the work of Fujii et al. (11), Seq2Logo also 
includes an option to visualize PSSM (position-specific 
scoring matrix) logos, where the height of a bar is given by 
the sum of the absolute value of the PSSM weight matrix 
values and the height of a given amino acids is proportional 
to the absolute value of the weight matrix score. In particu- 
lar, the weighted KL logo provides a visual and highly in- 
tuitive representation of both amino acid enrichment and 
depletion in for instance receptor binding motifs. Besides 
allowing input in the format of peptides and MSAs, the 
Seq2Logo server accepts inputs such as Blast sequence 
profiles, providing easy access for non-expert end-users to 
characterize and identify functionally conserved/ variable 
amino acids in any given protein of interest. 



MATERIALS AND METHODS 

Seq2Logo implements two strategies to improve the 
accuracy of the estimated sequence logo. The first 
strategy is sequence weighting which corrects for data re- 
dundancy. The second strategy is pseudo counts which 
correct for a low number of observations. Sequence 
weighting is implemented as described in (6,8) and 
pseudo counts as described in (7). For details, see 
Supplementary Data. 

In a sequence logo, the height of the bar is equal to 
the information content at each amino acid position. The 
information content is calculated using the relation 
I =J2Pa - logiPa/qa, where p a and g a are the observed prob- 
ability (calculated from the data) and background probabil- 
ity, respectively, of the amino acid a. If an equiprobable 
background amino acid distribution is applied, a conven- 
tional Shannon sequence logo is displayed. If a background 
amino acid distribution reflecting the prevalence of the dif- 
ferent amino acids is applied, a Kullback-Leibler sequence 
logo is displayed. The choice of the Kullback-Leibler 
logotype in Seq2Logo not only provides correction for the 
uneven distribution of amino acids, but also expresses the 
depleted amino acids (where p a < q a ) on the negative side of 
the j-axis. This enables the user to quickly identify enriched 
and depleted (under-represented) amino acids. To enhance 
the identification and information of the depleted 
amino acids, Seq2Logo includes another logotype called 
weighted Kullback-Leibler. This logo type presents each 
individual amino acid proportional to its relative log-odds 
score [log 2 (/? a /^a)]- Another logotype is included called 
probability weighted Kullback-Leibler, where the relative 
height of each individual amino acid is proportional to p a • 
log 2 (/? a /^a)- Finally, Seq2Logo includes an option to display 
PSSM-logos (11), where the height of a bar is equal to the 
sum of the absolute value of the PSSM weight matrix values 
and the height of each amino acid is proportional to the 
absolute value of the weight matrix score (with negative 
values displayed on the negative j-axis). 



THE WEB SERVER 

The Seq2Logo server has a simple interface that allows 
non-expert users to generate and customize accurate 
logos from any amino acid sequence data of interest. 



Input 

The interface is split in two parts for easy overview. The 
first and the most important part is submission (Figure 1, 
left panel). Here, the user can upload or paste in the input 
data in addition to specifying the logotype (Shannon, 
Kullback-Leibler, Weighted Kullback-Leibler, probabil- 
ity weighted Kullback-Leibler or PSSM-logo) and condi- 
tions for handling the input data (sequence weighting and 
pseudo counts). Seq2Logo can read sequence data in the 
following formats: Fasta, ClustalW, Raw peptide 
sequences and Weight/Blast matrix (for details on each 
format refer to Supplementary Data). The detection of 
the format happens automatically through the identifica- 
tion of key elements from each format. In the submission 
part, the user further specifies which output files should be 
created. In the graphical layout (Figure 1, right panel), the 
user can customize the graphical layout of the logo plot. 
Page size sets the resolution of the image and stacks per 
line and lines per page determine how the logo should 
look. Assigning each amino acid symbol to a color 
defines the amino acid colors. There are six colors to 
choose from: Red, green, blue, yellow, purple or orange. 
All amino acids left out will be black. Several predefined 
color-schemes are available. The user can also rotate the 
position numbers on the x-axis and hide various features 
of the graph. 

Output 

An example of the output from Seq2Logo generated using 
the input specifications from Figure 1 is shown in 
Figure 2. The figure shows on the positive j-axis, the 
amino acids enriched at each peptide position and on 
the negative j-axis the corresponding depleted amino 
acids. In this case, the logo is calculated from a set of 13 
artificial peptide sequences proposed to bind the 
HLA-A*02:01 class I major histocompatibility complex 
(MHC) molecule. This molecule has a binding motif 
with strong interactions at P2 and P9 both positions 
with prevalence for hydrophobic amino acids (12). 

One of the distinct powers of Seq2Logo is its ability to 
deal with data redundancy and low number of observa- 
tions. To the best of our knowledge, no other public 
sequence logo servers share this ability. In Figure 3, the 
cruciality of these features for the generation of accurate 
sequence logos describing a binding motif is illustrated. 
The figure displays Shannon sequence logos generated 
by Seq2Logo, using different option to improve the 
accuracy, as well as sequence logos generated by 
Weblogo (2) and EnoLOGOS (3). When comparing the 
logos calculated from the small sample data set with the 
logo obtained from the larger data set, it is apparent that 
the inclusion of sequence weighting and pseudo counts 
have a significant positive impact on the overall 
accuracy of the binding motif description. 

The other distinct feature of Seq2Logo compared 
to most other public sequence logo server is the display 
of depleted amino acids on the negative j-axis in 
Kullback-Leibler logos. Most sequence logo servers 
display the relative height of the different amino acids 
in a manner proportional to their frequency, thus 
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SUBMISSION 

Paste input ( MSA( Fasta and ClusialW ), peptide . Weight matrix . Blast matrix ) 
'if left empty, the raw peptide alignment will be used as test alignment 



KLLIPVLLL 

KARDPH5CH 

KACDPHSGH 

KASDPHSGH 

KARDPH5CV 

ELVSEFSRM 

MLDPTLLLV 

Fl AG MS AYE 

5MLCLLVEV 

STNRQSGRI 

AS«rDQSQ 

QVCrRIPTI 

ALAKAAAAV 



or submit a file directly from your local disk: 
Choose file no file selected 



Select Logotype: Kullback-Leibler 



Sequence weighting method: Clustering (Hobohml) 



Specify threshold for clustering (Hobohml) 0.63 



Threshold (Hobohml ) 



Weight on prior (pseudo counts): 200 



Unit Type: Bits 



Available Output Formats, (multi) 

JPEG 

PNC 
PDF 



Submit Clear fields 



Graphical Layout 



Stacks Per Line: 40 



Lines per page: 3 



Page size, either 'A4 ' or as [width ]x[h eight]: 640x460 
Title (optional): 



Graph Layout, (multi) 



Hide Y-axis 
Hide X-axis 
Hide Y-axis Label 
Hide Tineprint 
Hide ends 

Rotate X-axis numbers 



Amino Acids Colors: 

Choose a coloring scheme, or assign the amino acid color manually. 
Black is default if the amino acid is unassigned. 
Seq2Logo default t | 



Red: 



DE 



Green: NQSGTY 
Blue: RKH 
Yellow: 
Purple: 
Orange: 



| Submit | | Clear fields I 



Figure 1. The submission (left) and graphical layout (right) part of the web interface. In the submission part the user specifies the input file, the 
format of output files, the logotype and the conditions for the handling of the input data. In the Graphical Layout part, the user customizes the 
graphical layout of the logo plot; page size, stacks per line, lines per page, colours, bars, rotation of position numbers and title. 



Download logo as: EPS JPEG( l) Weightmatrix 



CD 




Last position-specific scoring 
A R N 

1 A 1.212 -0.046 -0.758 

2 L 0.266 -1.936 -2.068 

3 L 0.960 0.004 0.7B6 



0.834 
-1.748 
-1.374 



computed, values are : 



-0. 650 
3.832 
-1.018 
-1.214 
-1.038 
-1.938 



-2.506 
0.766 
0.828 
-0.850 
0.636 
-1.610 
-1.318 
-1.824 



-0.720 
-0.070 
4.306 
-1.414 
-1.134 



-1.334 
-0.966 
0.654 



1.484 -1.706 

-0.304 0.294 

-0.846 -0.782 

-0.504 0.366 



0.620 
-2.454 
-1.334 
-1.032 
0.318 
-1.516 
-1.346 
0.052 
-0.074 



-1.002 
-2.090 
-1.320 
1.988 
-1.570 
-1.612 
0.936 
0.700 
-2.750 



-0.954 
-0.582 
2.336 
-1.242 



-1.070 
1.930 
0.142 
-0.124 
-1.016 
1.246 
0.064 



-1.514 
-0.256 
1.166 
0.794 
0.676 
1.218 



1.558 
-1.816 
0.290 
1.422 
-0.310 
-1.552 
-0.994 
-0.220 
-1.308 



1.116 -2.086 0.750 



0.228 
-0.106 
-0.272 



-1.254 
1. 014 
1.102 
1.552 
-1.276 
-0.376 
-0.474 



-2.560 
1.000 
2.554 
-2.718 
1.064 
-2.526 
-3.082 



created by Seq2Logo 



-1.380 1.508 



-1.228 
0.160 
-0.786 



-0.926 
-0.356 
-1.794 



0.526 
-0.682 
2.542 



Figure 2. Output from Seq2Logo. The upper panel shows the sequence logo calculated from a set of 13 artificial peptide sequences using the 
specification defined in Figure 1 (sequence weighting using clustering, pseudo count with a weight of 200 and logotype as Kullback-Leibler). 
Enriched amino acids are shown on the positive j-axis and depleted amino acids on the negative j-axis. The lower panel gives the position-specific 
(log-odds) scoring matrix (PSSM) calculated by Seq2Logo. Each line corresponds to a position and gives the consensus amino acid and the log-odds 
scores for the 20 amino acids. 
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ftHit-q: ..smhurvHl. N C 



Figure 3. Sequence logos generated from small sequence samples. All logos except the right logo in the lower row were calculated from a set of 13 
artificial peptide sequences proposed to bind HLA-A*02:01 (see Figure 1). The upper row shows logos calculated by Seq2Logo using: (i) without 
sequence weighting and pseudo count correction, (ii) sequence weighting by clustering and no pseudo count correction and (hi) sequence weighting by 
clustering and pseudo count correction with a weight on prior of 200. The lower row shows logos calculated using: (i) Weblogo with 'small sample 
correction', (ii) EnoLOGOS and (hi) Seq2Logo from a set of 229 HLA-A*02:01 9mer ligands downloaded from the SYFPEITHI database (12) with 
sequence weighting by clustering and pseudo count correction with a weight on prior of 200. 



displaying only the position-specific enrichment of amino 
acids, discarding the equally valuable information related 
to amino acid depletion. To improve on this issue, 
Seq2Logo includes a series of distinct logotypes (see 
Figure 4). In addition to the usual Shannon logo, 
Seq2Logo includes the option to create Kullback-Leibler 
(KL) logos where depleted amino acids are represented on 
the negative j-axis. Besides the conventional KL logo, 
Seq2Logo can also display a weighted KL logo, where 
the relative height of each amino acid is proportional to 
the log-odds ratio and a probability weighted KL logo, 
where the relative height of each amino acid is propor- 
tional to the product of the probability and log-odds 
ratio. In particular, the weighted KL logo provides a 
visual and highly intuitive representation of both amino 
acid enrichment and depletion in for instance receptor 
binding motifs. Besides these information-based logo- 
types, Seq2Logo offers the possibility of displaying 
PSSM-logos calculated either from a log-odds weight 
matrix derived by Seq2Logo from a multiple sequence 
alignment or from a user-defined PSSM. In the 
PSSM-logo, the height of the bar and amino acid at 
each position is proportional to the absolute value of the 
PSSM weight matrix values. This logotype is particularly 
powerful when illustrating depletion of a small set of 
amino acids form otherwise variable positions in a 
sequence motif. One such example is N-linked 
glycosylation sites that are known to have the motif 
N-X-S/T where X can be any amino acid but 
P. Visualizing this motif as an information-based 
sequence logo will not capture the depletion of P at the 
position between N and S/T as all amino acids except 
P are found at this position, hence making the overall 
information content very small. On the other hand, 
visualizing the motif as a PSSM-logo, the strong depletion 



of P at the position between N and S/T becomes apparent 
(see Figure 5). 

A powerful way to characterize sequence conservation/ 
variation within a protein family is by use of sequence 
profiles. Such sequence profiles can be obtained using 
Psi-Blast (7). Seq2Logo accepts input of such sequence 
profile in the Blast profile format allowing easy access 
for non-expert end-users to characterize and identify func- 
tionally conserved/ variable amino acids in any given 
protein of interest. Blast sequence profile can be generated 
either in-house using a command like 'blastpgp — d db — e 
0.00001 — j 4 — Q blastprofile — i fasta — o out', where db is 
the sequence database used to search by Blast, — e defines 
the e-value cut-off for significant hits, —j defines the 
number of Psi-blast iterations, —i is the input file in 
FASTA format, —Q is the output file for the blast 
profile (the file to be used by Seq2Logo to visualized the 
sequence profile) and — o is the file for the blast output. 
Alternatively, the Blast2logo webserver (www.cbs.dtu.dk/ 
biotools/Blast21ogo (14 May 2012, date last accessed)) can 
be used to obtain the sequence profile. Figure 6 demon- 
strates the use of Seq2Logo to display a sequence profile 
for Rhamnogalacturonan acetylesterase (PDBid 1K7C, 
chain A). The active site of 1K7C.A is defined by the 
residues S9, G42, N74, D192 and H195 (13). All these 
residues are highly conserved in the sequence logo (in 
fact they are among the 10 residues with the highest infor- 
mation content, data not shown). Another striking obser- 
vation from the logo is the lack of sequence information in 
the area between positions 75 and 105, suggesting that this 
part of the protein is highly variable (most likely an inser- 
tion) within the protein family. Both these observations 
illustrate the power of sequence profiles combined with 
Seq2Logo as a simple tool to identify functionally import- 
ant residues and insertions in protein sequences. 
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Shannon Logo Kullback-Leibler Logo 




123456789 123456789 

created by Seq2Logo created by Seq2Logo 



Figure 4. The different logotype representations covered by Seq2Logo. Sequence logos generated from at set of 13 artificial peptide sequences 
proposed to bind HLA-A*02:01 (see Figure 1). All logos were calculated using clustering and pseudo counts with a weight on prior at 200. 
Upper row, left panel: Shannon, right panel: Kullback-Leibler. Lower row left panel: weighted Kullback-Leibler, right panel: probability 
weighted Kullback-Leibler. 




2 4 6 8 10 

created by Seq2Logo 



Figure 5. PSSM-logo for the N- linked glycosylation motif. The motif was calculated from a set of 2128 unique experimentally verify N-glycosylation 
sites downloaded from the UniprotKB protein database. Only peptide fragments of length 11 (5 before and 5 after the N) were included in the 
analysis. 
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Figure 6. Seq2Logo visualization of a Blast sequence profile for 1K7C chain A. The Blast profile was obtained using Blast21ogo (www.cbs.dtu.dk/ 
biotools/Blast21ogo (14 May 2012, date last accessed)) searching against the nr70 sequence database with default options. The active site of 1K7C:A 
is defined by the residues S9, G42, N74, D192 and H195 (13). All these residues show up as highly conserved in the sequence logo. 



INTEGRATING SEQ2LOGO WITH OTHER 
PREDICTION SERVERS 

To improve the usability and make Seq2Logo able to co- 
operate with other programs and servers, a form-handler 
was implemented on the server that makes it possible to 
send input data directly to Seq2Logo. This simple 
form-handler allows a quick and easy transfer of data to 
Seq2Logo and defines a platform for using Seq2Logo as a 
visualization tool for other programs. The form data sent 
to Seq2Logo is inserted directly into the input field. 
An instruction of how to implement this transfer can be 
found at: http://www.cbs.dtu.dk/biotools/Seq2Logo-L0/ 
bin/easytransferbutton.html (14 May 2012, date last 
accessed). 

DISCUSSION AND CONCLUSION 

Sequence logos provide a powerful way to visualize amino 
acid preferences in a receptor binding motif, as well as 
sequence conservation/variation and the location of func- 
tionally essential residues in multiple sequence alignments. 
Accurate estimation of a sequence motif is often 



compromised by data redundancy and low number of 
observations. Inappropriate handling of these issues can 
lead to inaccurate estimation of the sequence motif and 
subsequent poor sequence logo representation. Moreover, 
the majority of sequence logo webservers have a poor 
visualization of the information related to amino acid de- 
pletion since they focus on displaying the position-specific 
enrichment of amino acids. 

Here, we have proposed a novel sequence logo generator, 
Seq2Logo that aims at addressing these shortcomings and 
allow non-expert end-users, via an easy to use web-interface, 
to generate accurate sequence logos from protein sequence 
data. We have demonstrated that Seq2Logo can deal with 
sequence redundancy and low number of observations in a 
manner superior to that of other public available sequence 
logo generators like Weblogo and ENOlogos. Besides 
the conventional Shannon sequence logo, Seq2Logo also 
incorporates distinct logotypes where depleted amino 
acids are displayed on the negative j-axis. These logotypes 
offer a unique possibility for Seq2Logo to display for 
instance receptor-binding motifs in a format that highlights 
both favored and disfavored amino acids at the different 
positions in the motif. 
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A sequence profile is a powerful way to capture pos- 
ition-specific information about sequence conservation/ 
variation within a protein family. Seq2Logo accepts 
sequence profiles in the Blast format as input and can in 
a very simple and intuitive manner be used in combination 
with Blast as a tool to visualize sequence profiles and 
identify functionally conserved/variable amino acids in 
any given protein of interest. 

Finally, to allow other servers dealing with multiple 
sequence alignments and binding motifs to directly co- 
operate with Seq2Logo and benefit from its improved 
features, the server includes a form-handler that enables 
communication with Seq2Logo via a simple html form. 
This feature has allowed for a simple and effective im- 
provement to two of our own webservers NN Align (14) 
and Blast21ogo (www.cbs.dtu.dk/biotools/Blast21ogo 
(14 May 2012, date last accessed)), and we believe this 
to be an additional feature that will become very useful 
for other webserver developers within the field of for 
instance receptor-binding motif characterization. 

In its current form, Seq2Logo can only handle amino 
acid input data. The reason for this limitation is that most 
of its unique features like pseudo count estimates from 
Blosum substitution matrices and sequence weighting of 
are specific for amino acid data. The ability to also handle 
nucleic acids will be a part of a future update for the 
method. 

In conclusion, we believe Seq2Logo to be an important 
and novel tool for non-expert users to construct accurate 
sequence logos describing receptor binding motifs and 
sequence variations in multiple sequence alignments. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Methods and Supplementary References 
[6-8,15,16]. 
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